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> . Abstract 

The setting is a stationary, ergodic time series. The challenge is to 
construct a sequence of functions, each based on only finite segments 
of the past, which together provide a strongly consistent estimator for 
. the conditional probability of the next observation, given the infinite 

£f) . past. Ornstein gave such a construction for the case that the values 



are from a finite set, and recently Algoet extended the scheme to time 
series with coordinates in a Polish space. 

The present study relates a different solution to the challenge. The 
algorithm is simple and its verification is fairly transparent. Some 
extensions to regression, pattern recognition, and on-line forecasting 
s_! ■ are mentioned. 



1 Introduction 



In this section, we give brief overview of the situation with respect to non- 
parametric inference under the most lenient mixing conditions. Impetus for 
this line of study follows Roussas (1969) and Rosenblatt (1970) who ex- 
tended ideas in the nonparametric regression literature for i.i.d. variables 
to give a theory adequate for showing, for example, that for {Yj} a real 
Markov sequence, under Doeblin-like assumptions, the obvious kernel fore- 
caster is an asymptotically normal estimator of the conditional expectation 
E(X \X_ 1 = x). In the 1980's, there was an explosion of works which showed 
consistency in various senses for nonparametric auto-regression and density 
estimators under more and more general mixing assumptions (e.g., Castellana 
and Leadbetter (1986), Collomb (1985), Gyorfi (1981), and Masry (1986)). 
The monograph by Gyorfi et al. (1989) gives supplemental information about 
nonparametric estimation for dependent series. 

Such striving for generality stems from the inconvenience of mixing con- 
ditions; satisfactory statistical tests are not available. Some recent devel- 
opments have succeeded in disposing of these conditions altogether. In the 
Markov case, aside from some smoothness assumptions, it is enough that 
an invariant law exist to get the usual pointwise asymptotic normality of 
kernel regression (Yakowitz (1989)). In case of Harris recurrence but no 
invariant law, one can still attain a.s. pointwise convergence of a nearest- 
neighbor regression algorithm in which the neighborhood is chosen in advance 
and observations continue until a prescribed number of points fall into that 
neighborhood (Yakowitz (1993)). 

Pushing beyond the Markov hypothesis, by a histogram estimate (Gyorfi 
et al. (1989)) or a recursive-type estimator (Gyorfi and Masry (1990)), one 
can infer the marginal density of an ergodic stationary time series provided 
only that there exist an absolutely continuous transition density. Here the 
limit may have been attained; it is now known (Gyorfi et al. (1989) and 
Gyorfi and Lugosi (1992), respectively) that without the conditional density 
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assumption, the histogram estimator and the kernel and recursive kernel 
estimates for the marginal density are not generally consistent. 

The situation with respect to (auto-) regression is more inclusive for er- 
godic, stationary sequences. In a landmark paper, following developments by 
Ornstein (1978) for the case that the time series values are from a finite set, 
for time series with values in a Polish space, Algoet (1992, §5) has provided a 
data-driven distribution function construction F n (x\X-i,X-2, ■ ■ •) which a.s. 
converges in distribution to 

P(X < x|X_i,X_ 2 , . . .) = P(X < x\X~), 

where X" = pT_i, X_ 2 , . . .)■ 

The goal of the present study is to relate a simpler rule the consistency of 
which is easy to establish. In concluding sections, it is noted that as a result of 
these developments, one has a consistent regression estimate in the bounded 
time-series case, and implications to problems of pattern recognition and on- 
line forecasting are mentioned. It is to be conceded that our algorithm, as 
well as those of Algoet 's and Ornstein's, can be expected to require very large 
data segments for acceptable precision. 

As a final general comment, we note that the assumption of ergodicity 
may be relaxed somewhat. Thus in view of Sections 7.4 and 8.5 of Gray 
(1988), one sees that a nonergodic stationary process has an ergodic decom- 
position. With probability one, a realization of the time series falls into an 
invariant event on which the process is ergodic and stationary. Then one 
may apply the developments of this study to that event as though it were 
the process universe. Thus the analysis here also remains valid for stationary 
nonergodic processes. Our analysis is restricted to the case that the coor- 
dinates of the time series are real, but it is evident that the proofs extend 
directly to the vector-valued case. In view of Theorem 2.2 of Billingsley 
(1968, p. 14) it will be clear that the formulas and derivations to follow also 
hold if the X[s are in a Polish space. 
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2 Estimation of conditional distributions 



Let X = {X n } denote a real-valued doubly infinite stationary ergodic time 
series. Let 

X_j = (X_j, x_ j+1 , . . . , X_i) 

be notation for a data segment into the j-past, where j may be infinite. For 
a Borel set C one wishes to infer the conditional probability 

P(C\X-) = P(X eC\Xi 1 J. 

The algorithm to be promoted here is iterative on an index k = 1,2,... 
For each k, the data-driven estimate of P(C|X~) requires only a segment of 
finite (but random) length of X~. One may proceed by simply repeating the 
estimation process for k—1,2,. . . , until a given finite data record no longer 
suffices for the demands of the algorithm. The goal of the study will be 
to show that a.s. convergence can be attained. That is, our estimation is 
strongly consistent in the topology of weak convergence. 

The estimation algorithm is now revealed in the simple context of binary 
sequences, and afterwards, we show alterations necessary for more general 
processes. 

Define the sequences \ k -i and r k recursively (fc = 1,2,.. .). Put A = 1 
and let r k be the time between the occurrence of the pattern 

B(k) = (X_ Xk _ 1 ,...,X_ 1 )=Xl 1 Xki 

at time —1 and the last occurrence of the same pattern prior to time — 1. 
More precisely, let 

r k = mm{t>0:XZ 1 x ;^ t = XZ 1 Xk _ 1 }. 

Put 

Afc = Tk + 
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The observed vector B(k) a.s. takes a value having positive probability; thus 
by ergodicity, with probability 1 the string B(k) must appear infinitely often 
in the sequence Xz^- One denotes the kth estimate of P(C|X _ ) by P k (C), 
and defines it to be 

p k (c) = \ E M*-r,). (i) 

l<3<k 

Here l c is the indicator function for C. 

For the general case, we use a sub-sigma-field structure motivated by 
Algoet (1992, Section 5.2), which is more general. Let V k = {A kji , i = 
1,2, .. . ,rrik} be a sequence of finite partitions of the real line by (finite or 
infinite) right semi-closed intervals such that a(V k ) is an increasing sequence 
of finite a-algebras that asymptotically generate the Borel cx-field. Let G k 
denote the corresponding quantizer: 

G k (x) = A kti ifx G A k , t . 

The role of the feature vector in (1) is now played by the discrete quantity, 

B(k) = (G k (X_ Xk _ 1 ),...,G k (X_ 1 ))=G k (XZ 1 Xk _ i ). 

Now 

r k = min{t > : G^XZl'l-t) = G k (XZl k J}. 

Again, ergodicity implies that B{k) is almost surely to be found in the se- 
quence Gfc(Xr^), and with this generalization of notation, the kth estimate 
of P(C|X~) is still provided by formula (1). 

As in Algoet 's construct, the estimate P k is calculated from observations 
of random size. Here the random sample size is A^. To obtain a fixed sample 
size t > version, let K t be the maximum of integers k for which < t. Put 

P_ t (C) = P Kt (C). (2) 
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Theorem 1 Under the stationary ergodic assumption regarding {X n } and 
under the estimator constructs (1) and (2) described above, 



and 



limP fc (.)=P(-|X-) a.s, (3) 

fc— »oo 



limP_ t (.) = P(-|X-) a.s, (4) 

t— >oo 



in the weak topology of distributions. 

Proof. To begin with, assume that for some m, C G cr(V m ). The first 
chore is to show that a.s., 

P fc (C)-P(C|X-). 

For k > m we have that 
P k (C) - P(C\X~) 

1 

k 



\ E [lc{X- Tj ) - P{X- Tj G ClGj^XZ^J)} 

l<j<m 



+ {k k m \ k l _ m) j: < [lc{X_ T] ) - P(X_ Tj G ClG^XZlJ)} 

+ \ E P(X_ TJ GC7|G i _ 1 (X^_ 1 ))-P(C7|X-) 
(A; — m) 

= Pl k + y - ; P2 fc + P3 fc . 
Obviously, 

Pl fc -> a.s. 

Toward mastering P2^, one observes that P2& is an average of bounded 
martingale differences. To see this note that a{Gj{Xz\.)) j = 0,1,... is 
monotone increasing, and that lc(X- Tj ) is measurable on a(Gj(XZ\ j )) for 
j > m. The convergence of P2& can be established by Levy's classical result, 
namely, the Cesaro means of a bounded sequence of martingale differences 
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converge to zero almost surely. For a version suited to our needs, see, for 
example, Theorem 3.3.1 in Stout (1974). One may even obtain rates for 
P2fc through the use of Azuma's (1967) exponential bound for martingale 
differences. We have to prove that 

P3fc — > a.s. 

By Lemma 1 in the appendix, 

p(x_ Tj e cc, ,(.v \ j) = P(x e c\G 3 ,(.v \ j). 



Using this we get 

k 

E p (*o e c\G^(x:{)) - P(c\x~) 



PZ k = \ E P(X- Tj e ClG^iXi^J) - P(C|X-) 
l<j<k 

1 



h 

l<j<k 

By assumption, 

a(B(j)) T a(X-), 

which implies that 

'(GiPCi,)) T a(X-). 
Consequently by the a.s. martingale convergence theorem we have that 

P(X G ClGjiXZ^)) -> P(C|X-) a.s., 

and thus by the Toeplitz lemma (cf. Ash (1972) ) 

P3fc — > a.s. 

Let D denote the countably infinite set of rr's for which (— oo,x] G cr(Vk) for 
sufficiently large k. By assumption, D is dense in 1R. Define 

F k (x) = P k ((-oo,x]). 
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Also, set 

F(x) = P((-oo,x]\X-). 

By the preceding development we have the almost sure event H such that 
on H for all x E D 

F k (x) - F(x). (5) 

Since D is dense in 1R, we have (5) on H and for all continuity points of F(-), 
and (3) is proved. The convergence (4) is an obvious consequence of (3). 

3 Estimation of auto-regression functions 

The next result uses estimators 

R* = \ E X ~r 3 (6) 
l<j<k 

and 

R- t = ± E X-rr (7) 

Corollary 1 Assume that for some number D, a.s., \Xq\ < D < oo. Under 
the stationary ergodic assumption regarding {X n } and under the estimator 
constructs (6) and (7) described above, 



and 



lim R k = E(X \X~) a.s., (8) 

k— >oo 



lim R_ t = E(X \Xr) a.s. (9) 

t— +CO 



Proof. Define the function 



D, if x > D 
(j)(x) = { x, if -D < x < D 
-D, ifx<-D 
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Then 

R k = J xP k (dx) = J (j>(x)P k (dx) 

-> J <f){x)P(dx\X-) = J xP(dx\X~) = E(X \XT). 

because of Theorem 1 and the fact that convergence in distribution implies 
the convergence of integrals of the bounded continuous function <fi with re- 
spect to the actual distributions (Billingsley (1968)). Thus the proof of (8) 
is complete. The proof of (9) follows in the same way; just put P_ t in place 
of P k . 

The estimates R_ t converge almost surely to E(X \~X~) and are uni- 
formly bounded so \R-t — E{X Q \XZl)\ — > also in mean. Motivated by 
Bailey (1976), consider the estimator R t (oS) = R-t(T t u) which is defined 
in terms of (X , . . . ,X t -i) in the same way as R- t {uj) was defined in terms 
of (X_ t , . . . , X_i). (T denotes the left shift operator. ) The estimator 
R t may be viewed as an on-line predictor of X t . This predictor has spe- 
cial significance not only because of potential applications, but additionally 
because Bailey (1976) proved that it is impossible to construct estimators 
Rt such that always Rt — E^X^X^ 1 ) — > almost surely. An immediate 
consequence of Corollary 1 is that convergence in probability is verified. 
That is, the shift transformation T is measure preserving hence convergence 
R-t - E(X Q \Xzl) -> in L 1 implies convergence R t - E(X t \Xl~ 1 ) ^0 in 
L 1 and in probability. 

4 Pattern recognition 

Consider the 2-class pattern recognition problem with <i-dimensional feature 
vector X and binary valued label Y . Let T>~ = (Xz^jY^) be the data. 
In conventional pattern recognition problems (Xq, Yq) and T>~ are indepen- 
dent, so the best possible decision based on Xo and based on (Xq,T>~) are 
the same. Here assume that {pQ,Y^)} is a doubly infinite stationary and 
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ergodic sequence. The classification problem is to decide on Yq for given 
data (X ,T>~) in order to minimize the probability of misclassification. The 
Bayes decision g* is the best possible one. Let i](Xq,V~) be the a posteriori 
probability of Y — 1 (regression function): 

V(X ,V-) = P(Y = 1\X ,V~) = E(Y \X ,V-). 

Then g*(X ,T>~) = 1 if r](X ,V~) > 1/2 and otherwise. For an arbitrary 
approximation i] k = r)k(X , T>~) put g k = g k (X Q ,V~) — 1 if rjk > 1/2 and 
otherwise. Then it is easy to see (cf. Devroye and Gyorfi (1985), Chapter 
10) that 

< P(g k ^Y \X ,V-)-P(g*(X ,V-)^Y \X ,V-) 

< 2\r, k -r,(X ,V-)\. (10) 

The estimation is a slight modification of (1). Define the sequences A fc _i and 
Tfc recursively (k = 1,2, . . .). Put Ao = 1 and r k be the time between the 
occurrence of the pattern 

B(k) = (G fc (X_ Afc _J, Y_ Xk _^ . . . , G fc (X_0, G fc (Xo)) 

at time and the last occurrence of the same pattern in V~ . More precisely, 

r k = min{t > : G k (Xl{ k _ L _ t ) = G k {X\j,Y:l^_ t = Y^J}. 

Put 

Afc = T k + \ k -l- 

The observed vector B(k) a.s. takes a value of positive probability; thus 
by ergodicity B(k) has occurred with probability 1. One denotes the kth 
estimate of r](X ,V~) by r] k , and defines it to be 

l<j<k 
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Corollary 2 Under the stationary ergodic assumption regarding the process 
{(X n ,Y n )} and under the estimator construct (11) described above, 

P(g k ^Y \X ,V-) - P(g*(X ,V-) ^Yo\X ,V~) a.s. (12) 

Proof. Because of (10), we get (12) from 

r] k -> r](X , V~) a.s., 

the proof of which is similar to the proof of Theorem 1. 

Remark. It is also possible to construct a version of this estimate with fixed 

sample size t > in the same way as in (2) and (7). 

5 Appendix 

In the sequel, we use the notation of Section 2. 

Lemma 1 Under the stationary ergodic assumption regarding {X n }, for j = 
1,2,..., 

P(x_ Tj e c\G s ,(.v \ ;)) = P(x g C(; :i ,i.v .( .)). 

Proof. First of all, note that by definition, 

a(G j - l (X: 1 x ._ 1 )) = Ti-x 
= cr({Gj-i( X -L) = b -lni = m Y b -lni m = 1, 2, . . .), 

where bZ^ is an m- vector of sets from the finite partition Vj-i. 
Note also that 

B = {G^ 1 (XZ 1 J=bZ 1 m ,X J - 1 = m} 

are the (countable many) generating atoms of J-j-i, so we have to show that 
for any atom B the following equality holds: 

p(b n {x_ Tj . g C}) = p(b n {x g C}). 
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Aj_i is a stopping time, B is an m-dimensional cylinder set, which means that 
bz]n determines whether 7^ m (in which case 5 = and the statement 
is trivial) or Xj-i = m and then 

B = {G^XZD = bzlJ. 

For j = 1,2,... let 

fj = min{0 < t : G 3 (XZ^ 1+t ) = Gj(X A )}. 

Now 

r l [Bn{r r i,i. i eC}] 

— ^ '[{Cj-i(^-m) = b_l n ,Gj(X_^ l ) = Gj(X_^ n ), 
Gj(XZ^ t ) ? G j (XZ^ n ), < f < Z,X_ Z G C}] 

= = b_] n ,Gj{X_l n ) = Gj(X_^ l ), 

+ G^xz 1 ^), < t < I, x e C} 

= {Cj-l(^-m+z) = b_l n ,G j (X_^ n ) = Gj(X_^ l ), 

GjiXZ 1 ^) + GjiXZ™), < t < I, Xq g C} 

— {^-l(^-m) = Cj'(^-m) = Gj(X_^i), 

GAX-IXt) * GAxz'J, < t < I, X G C} 

= B l~l {fj — I, Xq g C}, 

where T denotes the left shift operator. 
By stationarity, it follows that 

p(s n {x_ Tj g c» 

00 

= ^P(Bn{r 3 = i,L ie C}) 

00 

= ^P(r'[5n{r J ,i,i,eC}]) 

00 

= 5]?(Bn{f J =i,i GC}) 

= P(BD{X eC}), 
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and the proof of Lemma 1 is complete. 
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